(Phase 1)Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709#2913
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
flux2dev modelopt fp8 script:
|
|
BLOCKING:
ModelOpt FP8 checkpoints should work in both |
…vllm-project#2920) Threads quant_config / prefix through HunyuanVideo15Attention, HunyuanVideo15TransformerBlock, and HunyuanVideo15Transformer3DModel so the modelopt FP8 adapter from vllm-project#2913 has somewhere to bind per-layer scales. Modulation, embeddings, proj_out stay raw nn.Linear (full precision). Signed-off-by: lishunyang <lishunyang12@163.com>
…eo-1.5 examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps, skips precision-sensitive layers (modulation, embeddings, output proj, token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920). vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter). Signed-off-by: lishunyang <lishunyang12@163.com>
prompt:https://paste.ubuntu.com/p/ypkqDtNxQN/
|
The default export_hf_checkpoint() doesn't actually serialize weights as FP8 for unknown model types like HunyuanVideo15Transformer3DModel — it saves BF16 placeholders. The HunyuanImage-3 calibration helper hit the same bug. Three changes: - Manually call modelopt.torch.export.unified_export_hf._export_quantized_weight per-module to convert in-memory tensors to actual FP8. - Save the pipeline by hand (copy source minus transformer/, then save the quantized transformer with hide_quantizers_from_state_dict). - Patch transformer/config.json to inject quant_algo: FP8 + config_groups so vllm-omni's adapter (vllm-project#2913) auto-detects it. Signed-off-by: lishunyang <lishunyang12@163.com>
…block
When --weight-block-size 'M,N' is given, override the weight quantizer with
block_sizes={-1: N, -2: M} so each linear gets a (out//M, in//N) scale tensor
instead of a scalar. Patched config_groups advertises strategy='block' +
block_structure='MxN' so consumers know what to expect.
Static FP8 is exempt from upstream vLLM's online block-wise gate, so this
just works at serving time via vllm-project#2913's adapter.
Default behavior unchanged (per-tensor) — pass --weight-block-size 128,128
to opt in.
Signed-off-by: lishunyang <lishunyang12@163.com>
…ject#2920) Threads quant_config / prefix through WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines (T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding (Conv3d), time/text/image embedders, and proj_out stay full precision. All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here — that was an online-FP8 quality workaround; static calibration handles it. Signed-off-by: lishunyang <lishunyang12@163.com>
…ject#2920) Threads quant_config / prefix through WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines (T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding (Conv3d), time/text/image embedders, and proj_out stay full precision. All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here — that was an online-FP8 quality workaround; static calibration handles it. Signed-off-by: lishunyang <lishunyang12@163.com>
|
z image : offline
|
a9b3165 to
263be06
Compare
|
for e2e test: |
|
We should have a unified model weight conversion script, such as those in vllm-omni/vllm_omni/quantization/tools, and compare_diffusion_trajectory_similarity scripts. WDYT @baonudesifeizhai @lishunyang12 |
|
Quality outputs look good but we have no perf numbers for any of the 5 models. Can you share:
Want to validate the perf story before merging. |
|
after force_kernel=PerTensorTorchFP8ScaledMMLinearKernel on vllm side ... |
|
https://paste.ubuntu.com/p/92yBc9x7bB/
|
a689e90 to
22fbfd5
Compare
tests/diffusion/quantization/test_quantization_quality.py::test_quantization_quality[qwen_image_2512_modelopt_fp8_dynamic_all] PASSED [100%] |
| self.shared_experts = None | ||
|
|
||
| self.experts = SharedFusedMoE( | ||
| self.experts = FusedMoE( |
There was a problem hiding this comment.
We haven't validated this model yet, so we won't modify it for now.
| from vllm.inputs import MultiModalDataDict | ||
| from vllm.logger import init_logger | ||
| from vllm.model_executor.layers.fused_moe import SharedFusedMoE | ||
| from vllm.model_executor.layers.fused_moe import FusedMoE |
There was a problem hiding this comment.
Why should this be changed to FusedMoE
| from vllm.entrypoints.pooling.embed.serving import ServingEmbedding as OpenAIServingEmbedding | ||
| from vllm.entrypoints.pooling.pooling.serving import OpenAIServingPooling | ||
| from vllm.entrypoints.pooling.score.serving import ServingScores | ||
| from vllm.entrypoints.pooling.pooling.serving import ServingPooling as OpenAIServingPooling |
There was a problem hiding this comment.
Why should we modify it here?
|
|
||
| import vllm.forward_context as _vllm_fc | ||
| from vllm.model_executor.layers.fused_moe import SharedFusedMoE | ||
| from vllm.model_executor.layers.fused_moe import FusedMoE |
| return rotary_position_embedding(x, cos, sin, rotated_mode="rotated_half", head_first=False, fused=True) | ||
|
|
||
|
|
||
| def _ensure_batch_dim(x: torch.Tensor) -> tuple[torch.Tensor, bool]: |
There was a problem hiding this comment.
Why should we modify it here?
| quantization="fp8", | ||
| task="t2i", | ||
| prompt="a cup of coffee on a wooden table, morning light", | ||
| max_lpips=0.35, |
There was a problem hiding this comment.
I think the max_lpips threshold was set too arbitrarily before, and one metric isn't enough; we need to add metrics like psnr or mae to monitor it. I believe we should define this threshold properly first.
| @@ -0,0 +1,86 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
I think this test should include accuracy-related tests; simply testing functionality is meaningless.
| CLI: | ||
|
|
||
| ```bash | ||
| python text_to_image.py --model <your-model> --quantization fp8 |
There was a problem hiding this comment.
we should add modelopt.md and .nav.yml such as https://docs.vllm.ai/en/latest/features/quantization/modelopt/ and follow vllm-omni quantization style.
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
|
LGTM now. @lishunyang12 PTAL thx |
|
We may need to add a modelopt quantization script tool later. Thank you for your contribution. |
vllm-project#2709 (vllm-project#2913) Signed-off-by: roG0d <rodgarcas98@gmail.com> Signed-off-by: roG0d <baonudesifeizhai@gmail.com> Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com> Co-authored-by: roG0d <rodgarcas98@gmail.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> Signed-off-by: Zeyu Huang | 黃澤宇 <11222265+fhfuih@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> Signed-off-by: Zeyu Huang | 黃澤宇 <11222265+fhfuih@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>









PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
#2709
This PR adds Phase 1 support for ModelOpt FP8 diffusion checkpoints.
quantization_configfrom diffusion checkpoint configs.fp8stage configs to checkpoint-specific ModelOpt FP8 when serialized ModelOpt metadata is present.Validation
Validated ModelOpt FP8 image generation on:
Benchmark Setup
All results below use the following settings unless otherwise noted:
num-prompts=100request-rate=infwarmup-requests=0width=1024height=1024num-inference-steps=20seed=42For online serving benchmarks, we use:
max-concurrency=32BF16 vs ModelOpt FP8
Offline vs Online
Observations
HunyuanImage3,Qwen-Image-2512,Z-Image, andFLUX.2-klein-4Ball show consistent gains from ModelOpt FP8 in both offline and online settings.FLUX.2-devis the main exception in this set: ModelOpt FP8 reduces peak memory, but both offline and online throughput regress relative to BF16.HunyuanImage3, with roughly 21% throughput gain and 16% mean latency reduction.TODO
FLUX.2-dev‘qwen image ’ lantencyBF16 activation → FP8 activation quantization → FP8 GEMM → BF16 output
--
Test Plan
modeloptfp8 for qwen-image:
https://paste.ubuntu.com/p/gby859n2Qt/
hunyuan modeoptfp8 : https://paste.ubuntu.com/p/dTgpmNzw3K/
CUDA_VISIBLE_DEVICES=0,1

PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH
/root/zdj/vllm/.venv/bin/python
examples/offline_inference/text_to_image/text_to_image.py
--model /root/zdj/models/hunyuan-image3-modelopt-fp8
--stage-configs-path vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml
--prompt "a cinematic photo of a red fox standing in a snowy pine forest, soft morning light, highly detailed"
--guidance-scale 4.0
--height 512
--width 512
--num-inference-steps 20
--seed 42
--use-system-prompt en_vanilla
--output outputs/hunyuan_image3_modelopt_fp8_steps20.png
--stage-init-timeout 900
--init-timeout 900
2>&1 | tee outputs/hunyuan_image3_modelopt_fp8_steps20.log
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)